Case Study 4 - Time Series
Time Series Analysis of Seasonal Flu
Martin Garcia, Michael Catalano, Jeremy Otsap, Christian Nava
July 1, 2020
Introduction
The seasonal flu
We are going to use ARIMA to model the seasonal flu.
- Extract historical flu data (positive cases over time) – you can choose to model flu patterns at the national or regional/state-level
- Build an ARIMA model; determine the appropriate values for (p,d,q)
- How well does your model perform on validation data? (Note: you’ll need to create a training and validation set to measure forecast accuracy)
- Provide analysis to support your determinations
Weekly flu data are taken from the World Health Organization’s (WHO) FluNet database from October 04, 2010 through June 07, 2020. The data was cleaned and only 4 of the 22 variables were kept: Week, SDATE, SPEC_PROCESSED_NB, and ALL_INF. The latter three were renamed to wk_date.start, total_specimens.tested, and total_flu_cases.positive, respectively. An additional variable, total_flu_cases.percent_positive, was created, which is a percentage of the total specimens processed that tested positive for any strain of influenza. We will use this variable as our target time series variable.
FluNetReport <- FluNetReport %>%
select(Week, SDATE, SPEC_PROCESSED_NB, ALL_INF) %>%
rename(wk_date.start = SDATE,
total_specimens.tested = SPEC_PROCESSED_NB,
total_flu_cases.positive = ALL_INF)
# add column for percent positive cases
FluNetReport = mutate(FluNetReport, total_flu_cases.percent_positive = total_flu_cases.positive/total_specimens.tested*100)
# convert data type from string to date
FluNetReport$wk_date.start <- as.Date(FluNetReport$wk_date.start, "%m/%d/%y")# create time series of percent positive flu cases
ts_flu.percent_positive_cases <- ts(FluNetReport[ ,5])# get a smaller sample using only the most recent five years of data (most recent 260 weeks)
most_recent_5_years.flu_percent_positive_cases <- FluNetReport %>%
slice(tail(row_number(), 260))
# convert to time series object
ts_most_recent_5_years.flu_percent_positive_cases <- ts(most_recent_5_years.flu_percent_positive_cases[ ,5])
# plot the data
plotts.sample.wge(ts_most_recent_5_years.flu_percent_positive_cases)$autplt
[1] 1.000000000 0.976805102 0.924536826 0.853018652 0.767655386
[6] 0.670334051 0.562097838 0.446930491 0.328412250 0.210504886
[11] 0.097492674 -0.008848353 -0.106134687 -0.191755787 -0.265062059
[16] -0.327303640 -0.378929113 -0.420902512 -0.452465921 -0.474641016
[21] -0.489073342 -0.497986988 -0.503158896 -0.504427290 -0.501622425
[26] -0.498358721
$freq
[1] 0.003846154 0.007692308 0.011538462 0.015384615 0.019230769 0.023076923
[7] 0.026923077 0.030769231 0.034615385 0.038461538 0.042307692 0.046153846
[13] 0.050000000 0.053846154 0.057692308 0.061538462 0.065384615 0.069230769
[19] 0.073076923 0.076923077 0.080769231 0.084615385 0.088461538 0.092307692
[25] 0.096153846 0.100000000 0.103846154 0.107692308 0.111538462 0.115384615
[31] 0.119230769 0.123076923 0.126923077 0.130769231 0.134615385 0.138461538
[37] 0.142307692 0.146153846 0.150000000 0.153846154 0.157692308 0.161538462
[43] 0.165384615 0.169230769 0.173076923 0.176923077 0.180769231 0.184615385
[49] 0.188461538 0.192307692 0.196153846 0.200000000 0.203846154 0.207692308
[55] 0.211538462 0.215384615 0.219230769 0.223076923 0.226923077 0.230769231
[61] 0.234615385 0.238461538 0.242307692 0.246153846 0.250000000 0.253846154
[67] 0.257692308 0.261538462 0.265384615 0.269230769 0.273076923 0.276923077
[73] 0.280769231 0.284615385 0.288461538 0.292307692 0.296153846 0.300000000
[79] 0.303846154 0.307692308 0.311538462 0.315384615 0.319230769 0.323076923
[85] 0.326923077 0.330769231 0.334615385 0.338461538 0.342307692 0.346153846
[91] 0.350000000 0.353846154 0.357692308 0.361538462 0.365384615 0.369230769
[97] 0.373076923 0.376923077 0.380769231 0.384615385 0.388461538 0.392307692
[103] 0.396153846 0.400000000 0.403846154 0.407692308 0.411538462 0.415384615
[109] 0.419230769 0.423076923 0.426923077 0.430769231 0.434615385 0.438461538
[115] 0.442307692 0.446153846 0.450000000 0.453846154 0.457692308 0.461538462
[121] 0.465384615 0.469230769 0.473076923 0.476923077 0.480769231 0.484615385
[127] 0.488461538 0.492307692 0.496153846 0.500000000
$db
[1] 2.853836 3.371205 1.512745 -3.416015 19.406263 3.955619
[7] 6.264924 4.168738 1.003093 12.351847 -6.302650 4.925540
[13] -2.243765 -2.337566 -1.314618 -5.756434 -2.911888 -12.738661
[19] -5.425095 -7.707159 -3.076971 -12.364084 -15.015323 -9.374249
[25] -8.927485 -8.828385 -22.509648 -15.285883 -11.355253 -8.429517
[31] -16.035502 -16.029324 -12.703345 -11.390240 -8.461791 -10.434273
[37] -26.719742 -15.374167 -18.944671 -10.352591 -10.058433 -19.926264
[43] -19.652971 -25.808611 -12.083546 -10.620464 -28.086260 -28.758134
[49] -15.075307 -11.762048 -16.007209 -24.554443 -24.490342 -12.955775
[55] -18.559671 -26.147727 -27.438279 -22.311166 -20.626683 -16.591153
[61] -30.394407 -29.515385 -19.653743 -26.984700 -14.688882 -22.302582
[67] -25.104405 -27.881295 -23.379006 -19.065600 -24.021090 -21.477656
[73] -28.361080 -27.474549 -18.335920 -24.532227 -32.461248 -31.589200
[79] -30.190570 -22.779024 -28.008706 -23.449389 -33.046573 -25.891459
[85] -30.773434 -26.712263 -24.919340 -35.339262 -25.312662 -30.714813
[91] -35.838483 -27.000893 -26.529210 -27.730788 -24.268416 -41.431646
[97] -23.163481 -26.623638 -28.840145 -19.560907 -31.244290 -35.633536
[103] -26.591732 -33.486899 -22.012749 -28.718999 -26.980171 -31.778016
[109] -23.171507 -31.410986 -31.471537 -30.545238 -32.304405 -20.590284
[115] -31.923601 -28.602621 -33.050717 -32.016390 -23.014361 -27.260538
[121] -29.887522 -41.389545 -35.334682 -27.951645 -29.230879 -27.120772
[127] -26.224603 -38.840550 -26.487843 -32.438321
$dbz
[1] 9.9434515 10.0390873 10.1590193 10.2587707 10.2955923 10.2346669
[7] 10.0506316 9.7265870 9.2524265 8.6234421 7.8395454 6.9051289
[13] 5.8294341 4.6271537 3.3188019 1.9302183 0.4907195 -0.9697054
[19] -2.4221145 -3.8361120 -5.1718807 -6.3719957 -7.3665157 -8.1024670
[25] -8.5821543 -8.8702976 -9.0590782 -9.2262617 -9.4156196 -9.6387720
[31] -9.8871440 -10.1461994 -10.4070889 -10.6718834 -10.9510846 -11.2560475
[37] -11.5910644 -11.9489671 -12.3120308 -12.6580333 -12.9691050 -13.2388103
[43] -13.4731626 -13.6852307 -13.8874313 -14.0864596 -14.2832650 -14.4772358
[49] -14.6715698 -14.8763558 -15.1074307 -15.3816528 -15.7109127 -16.0971381
[55] -16.5296771 -16.9859320 -17.4359069 -17.8503412 -18.2097356 -18.5096296
[61] -18.7590391 -18.9736537 -19.1686016 -19.3545124 -19.5373696 -19.7203435
[67] -19.9054287 -20.0937435 -20.2848111 -20.4761004 -20.6640321 -20.8466030
[73] -21.0263837 -21.2119042 -21.4160771 -21.6519489 -21.9274333 -22.2410153
[79] -22.5800796 -22.9230590 -23.2456747 -23.5294218 -23.7679217 -23.9667776
[85] -24.1365835 -24.2834523 -24.4027031 -24.4792643 -24.4949989 -24.4393940
[91] -24.3172889 -24.1487619 -23.9620183 -23.7846098 -23.6376216 -23.5339740
[97] -23.4795086 -23.4749974 -23.5177365 -23.6021592 -23.7196336 -23.8581640
[103] -24.0029708 -24.1386616 -24.2528342 -24.3397436 -24.4020351 -24.4493389
[109] -24.4943521 -24.5484090 -24.6184940 -24.7065985 -24.8111606 -24.9295055
[115] -25.0598493 -25.2016687 -25.3540736 -25.5129240 -25.6682683 -25.8039182
[121] -25.9004379 -25.9412722 -25.9194873 -25.8413273 -25.7243973 -25.5918512
[127] -25.4661240 -25.3649210 -25.2999912 -25.2776917
plot(ts_most_recent_5_years.flu_percent_positive_cases,
main=c(paste("Weekly Incidence Rate of Flu in the United States"),
paste("from October 04, 2010 through June 07, 2020")),
xlab="Week",
ylab="Percent Positive Cases")(Need to talk about stationarity here and the conditions required for stationarity)
From a visual inspection it is difficult to determine if the mean is constant over time. If the mean were constant over time then every time period would have the same mean, i.e., the mean for December is the same for every . Below is a plot of mean positive rates for every week of the year. The plot clearly shows there are weekly differences.
by_week <- most_recent_5_years.flu_percent_positive_cases %>%
select(Week, total_flu_cases.percent_positive) %>%
group_by(Week)
mean_rate <- by_week %>% summarise(mean_flu_positive_rate = mean(total_flu_cases.percent_positive))
plot(mean_rate$Week,mean_rate$mean_flu_positive_rate, type = "o", main = "Mean Weekly Flu Positive Rate \n 15 June 2015 through 01 June 2020", xlab = "Week", ylab = "Mean Rate")The correlation between data points (covariance) only depends on how far apart they are in time and not where they are in time (i.e., contstant autocovariance). The ACF plots below split the data in half to see if the autocorrelations (autocovariance) change over time. Autocorrelations that change over time would imply a non-stationary time series. Comparing the first half of the data to the second half of the data shows the ACFs are nearly identical. This suggests the autocovariance of the data is constant over time.
# to compare the ACF structure of the first half of the data to the second half.
par(mfrow = c(1,2))
acf(ts_most_recent_5_years.flu_percent_positive_cases[1:130])
acf(ts_most_recent_5_years.flu_percent_positive_cases[131:260])There appears to be cyclic behavior in the data, which could imply seasonality and a non-stationary process. However, time series data with cyclic behavior and no trend or seasonality is considered stationary if the cycles are not of a fixed length. Per the realization plot above and Table 1 below, there does not appear to be a trend, and it is evident that the cycles (measuring from peak to peak) are not of a fixed length. The number of weeks that elapse between peaks for the time series varies between 41 and 63 weeks. Intuitively, this makes sense as the peak of the flu “season” doesn’t necessarily fall on the same week or month every year. Therefore, from a visual inspection of the plots, the data appear to come from a stationary process.
Table 1: Flu Season Peak Week
| Flu Season | Peak Week Start Date | Peak Week Number | # of Weeks Between Peaks | Positive Rate |
|---|---|---|---|---|
| 2010-2011 | January 31, 2011 | 5 | N/A | 35.49% |
| 2011-2012 | March 12, 2012 | 11 | 58 | 31.90% |
| 2012-2013 | December 24, 2012 | 52 | 41 | 38.18% |
| 2013-2014 | December 23, 2013 | 52 | 52 | 30.61% |
| 2014-2015 | December 22, 2014 | 52 | 52 | 32.37% |
| 2015-2016 | March 07, 2016 | 10 | 63 | 28.59% |
| 2016-2017 | February 20, 2017 | 8 | 50 | 28.17% |
| 2017-2018 | January 08, 2018 | 2 | 46 | 30.50% |
| 2018-2019 | February 25, 2019 | 9 | 59 | 29.58% |
| 2019-2020 | February 03, 2020 | 6 | 49 | 32.74% |
Cochrane-Orcutt Test to Check for Trend
To check if there is a trend, the Cochrane-Orcutt test is employed, which tests the hypothesis that there is no serial correlation in the residuals (i.e., no trend). The Cochrane-Orcutt test yields a p-value of 0.533 and fails to reject the null hypothesis, which suggests there is no trend.
# Check for trend using Cochrane-Orcutt
df <- most_recent_5_years.flu_percent_positive_cases[,c(2,5)]
x <- ts_most_recent_5_years.flu_percent_positive_cases
t = seq(1,260,1)
fit = lm(x~t, data = df)
# Cochrane-Orcutt test
cfit = cochrane.orcutt(fit)
summary(cfit)Call:
lm(formula = x ~ t, data = df)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.339934 17.627719 1.097 0.2736
t -0.055471 0.088742 -0.625 0.5325
Residual standard error: 1.9762 on 257 degrees of freedom
Multiple R-squared: 0.0015 , Adjusted R-squared: -0.0024
F-statistic: 0.4 on 1 and 257 DF, p-value: < 5.325e-01
Durbin-Watson statistic
(original): 0.04223 , p-value: 2.763e-57
(transformed): 0.58484 , p-value: 6.989e-31
Dickey-Fuller Test for Stationarity
Employing an augmented Dickey-Fuller test is a more formal approach to check for stationarity. An augmented Dickey-Fuller test helps determine if one or more seasonal factors should be included in the model and tests the null hypothesis that the autoregressive model has a root outside of the unit circle. The test depends on failing to reject the null hypothesis to decide whether there is a unit root present. However, failing to reject the null hypothesis is not evidence that a unit root (i.e., seasonal factor) exists.
A p-value > 0.5 for the augmented Dickey-Fuller test fails to reject the null hypothesis, which means that a unit root, or one or more seasonal factors, may be present. In the case of this data, the augmented Dickey-Fuller test rejects the null hypothesis with a p-value of 0.01, suggesting there are no seasonal factors present and validating the initial visual inspection of the data.
Per the initial visual inspection, the Cochrane-Orcutt test, and the Dickey-Fuller test, the data for the first model will be assumed to be stationary with no trend and no seasonal factors.
# Check for stationarity using the Dickey-Fuller test
adf.test(ts_most_recent_5_years.flu_percent_positive_cases)
Augmented Dickey-Fuller Test
data: ts_most_recent_5_years.flu_percent_positive_cases
Dickey-Fuller = -5.081, Lag order = 6, p-value = 0.01
alternative hypothesis: stationary
Model Selection Methodology
The Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are used in this case study to select candidate models. The AIC and the BIC are measures that score a model based on its log-likelihood and complexity. The AIC aims to reduce the white noise variance in the model and penalizes models that add additional terms. The BIC is concurrently employed as the AIC tends to select higher order models, i.e., it may propose selecting an ARMA (2,2) over ARMA (1,1) model. The BIC imposes stronger penalties for increasing the orders of \(p\) and \(q\) and will tend to select models with fewer parameters. Lower values for AIC and BIC are preferred as they indicate
Candidate Model 1 - ARMA(4,1)
For this candidate model, we will assume that the realization is stationary. We use aic5.wge to identify the models with the lowest AIC and BIC. The AIC identifies a potential ARMA(3,2) model.
---------WORKING... PLEASE WAIT...
Five Smallest Values of aic
| p | q | aic | |
|---|---|---|---|
| 14 | 4 | 1 | 0.5787330 |
| 12 | 3 | 2 | 0.5818852 |
| 18 | 5 | 2 | 0.5984171 |
| 9 | 2 | 2 | 0.6088504 |
| 16 | 5 | 0 | 0.6135119 |
The BIC identifies a potential ARMA(4,1) model.
---------WORKING... PLEASE WAIT...
Five Smallest Values of bic
| p | q | bic | |
|---|---|---|---|
| 14 | 4 | 1 | 0.6609025 |
| 12 | 3 | 2 | 0.6640548 |
| 7 | 2 | 0 | 0.6755905 |
| 9 | 2 | 2 | 0.6773250 |
| 16 | 5 | 0 | 0.6956814 |
Both the AIC and BIS methods select an ARMA (4,1) model as the best structure for the data.
Estmimating the parameters of the model using \(p=4\) and \(q=1\) the model takes on the form \((1-2.55B+2.26^2-0.80B^3+0.11^4)X_t = (1-0.95B)a_t, \;\;\;\sigma_a^2 = 2.14\;\).
Coefficients of Original polynomial:
2.5568 -2.2625 0.8026 -0.1052
Factor Roots Abs Recip System Freq
1-1.9422B+0.9593B^2 1.0123+-0.1327i 0.9794 0.0208
1-0.6145B+0.1097B^2 2.8007+-1.1274i 0.3312 0.0609
A plot of the forecast for the last year of data, 52 weeks, suggests the model doesn’t do a very good job of predicitng the next year’s worth of data.
f = fore.aruma.wge(ts_most_recent_5_years.flu_percent_positive_cases,
phi=params$phi,
theta=params$theta,
n.ahead = 52,
lastn = TRUE,
plot=TRUE,
limits=TRUE)Performance Metric
We will use the average squared error (ASE) to measure the goodness of fit of the model (performance). The ASE measure takes the sum of the square of the difference between the forecasted, or predicted, value, \(\hat X_i\), and the actual value, \(X_i\), and takes the average for \(n\) number of observations. A lower ASE value indicates the model made fewer forecast errors.
\[ASE = \frac{\sum(\hat X_i - X_i)^2}{n}\]
The training dataset will be comprised of at least 130 weeks, or 2.5 years of data. The forecast horizon will be used as the validation set.
More explanation goes in here about how a rolling window ASE is better than a single ASE.
Ideally, for the rolling window ASE charts below, a low and steady ASE value (red line) as compared to the observed values (blue line) is preferred. This would indicate that the model did a good job of predicting most, if not all, the observed values. Spikes in the ASE value represent observed values that were not predicted well, areas of large error.
#Code from Prof. Sadler's Time Series Course Unit 7
#Model 1
phis = params$phi
thetas = params$theta
s = 0
d = 0
trainingSize = 130 # this is the window size (we used a window of 2.5 years or 130 weeks)
horizon = 2 # we forecast out 2 weeks
ASEHolder = numeric() # this is an empty varible that will hold all the ASE values
for( i in 1:(260-(trainingSize + horizon) + 1))
{
forecasts = fore.aruma.wge(ts_most_recent_5_years.flu_percent_positive_cases[i:(i+(trainingSize-1))],
phi = phis, theta = thetas,
s = s, d = d, n.ahead = horizon,plot=FALSE)
ASE = mean((ts_most_recent_5_years.flu_percent_positive_cases[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
ASEHolder[i] = ASE
}
WindowedASE = mean(ASEHolder)
WindowedASE[1] 5.446276
[1] 0.7690026
# visualization of windowed ASE over time
newASE = c(rep(NA, 131), ASEHolder) # this plots ASE from week 131 onward
par(mar = c(5,5,2,5))
plot(ts_most_recent_5_years.flu_percent_positive_cases,type="l",ylab='Observation Value',xlab='Time',col="blue",main = 'Rolling Window ASE Over Time \n(2-week forecast)'
)
par(new = T)
plot(ASEHolder,type="l",lty=2,axes=F,ylab=NA,xlab=NA,col="red"
)
axis(side=4)
mtext(side=4, line=3, 'ASE')
legend("topleft",legend=c("Obs. Value","ASE"),lty=c(1,2),col=c("blue","red"),cex=.6
)#Code from Prof. Sadler's Time Series Course Unit 7
#Model 1
phis = params$phi
thetas = params$theta
s = 0
d = 0
trainingSize = 130 # this is the window size (we used a window of 2.5 years or 130 weeks)
horizon = 4 # we forecast out 1 months, or 4 weeks
ASEHolder = numeric() # this is an empty varible that will hold all the ASE values
for( i in 1:(260-(trainingSize + horizon) + 1))
{
forecasts = fore.aruma.wge(ts_most_recent_5_years.flu_percent_positive_cases[i:(i+(trainingSize-1))],
phi = phis, theta = thetas,
s = s, d = d, n.ahead = horizon,plot=FALSE)
ASE = mean((ts_most_recent_5_years.flu_percent_positive_cases[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
ASEHolder[i] = ASE
}
WindowedASE = mean(ASEHolder)
WindowedASE[1] 12.69977
[1] 2.908173
# visualization of windowed ASE over time
#newASE = c(rep(NA, 131), ASEHolder) # this plots ASE from week 131 onward
par(mar = c(5,5,2,5))
plot(ts_most_recent_5_years.flu_percent_positive_cases,type="l",ylab='Observation Value',xlab='Time',col="blue",main = 'Rolling Window ASE Over Time \n(4-week forecast)'
)
par(new = T)
plot(ASEHolder,type="l",lty=2,axes=F,ylab=NA,xlab=NA,col="red"
)
axis(side=4)
mtext(side=4, line=3, 'ASE')
legend("topleft",legend=c("Obs. Value","ASE"),lty=c(1,2),col=c("blue","red"),cex=.6
)#Code from Prof. Sadler's Time Series Course Unit 7
#Model 1
phis = params$phi
thetas = params$theta
s = 0
d = 0
trainingSize = 130 # this is the window size (we used a window of 2.5 years or 130 weeks)
horizon = 12 # we forecast out 3 months, or 12 weeks
ASEHolder = numeric() # this is an empty varible that will hold all the ASE values
for( i in 1:(260-(trainingSize + horizon) + 1))
{
forecasts = fore.aruma.wge(ts_most_recent_5_years.flu_percent_positive_cases[i:(i+(trainingSize-1))],
phi = phis, theta = thetas,
s = s, d = d, n.ahead = horizon,plot=FALSE)
ASE = mean((ts_most_recent_5_years.flu_percent_positive_cases[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
ASEHolder[i] = ASE
}
WindowedASE = mean(ASEHolder)
WindowedASE[1] 37.91318
[1] 21.63031
# visualization of windowed ASE over time
newASE = c(rep(NA, 131), ASEHolder) # this plots ASE from week 131 onward
par(mar = c(5,5,2,5))
plot(ts_most_recent_5_years.flu_percent_positive_cases,type="l",ylab='Observation Value',xlab='Time',col="blue",main = 'Rolling Window ASE Over Time \n(3-month forecast)'
)
par(new = T)
plot(ASEHolder,type="l",lty=2,axes=F,ylab=NA,xlab=NA,col="red"
)
axis(side=4)
mtext(side=4, line=3, 'ASE')
legend("topleft",legend=c("Obs. Value","ASE"),lty=c(1,2),col=c("blue","red"),cex=.6
)#Code from Prof. Sadler's Time Series Course Unit 7
#Model 1
phis = params$phi
thetas = params$theta
s = 0
d = 0
trainingSize = 130 # this is the window size (we used a window of 2.5 years or 130 weeks)
horizon = 26 # we forecast out 6 months, or 26 weeks
ASEHolder = numeric() # this is an empty varible that will hold all the ASE values
for( i in 1:(260-(trainingSize + horizon) + 1))
{
forecasts = fore.aruma.wge(ts_most_recent_5_years.flu_percent_positive_cases[i:(i+(trainingSize-1))],
phi = phis, theta = thetas,
s = s, d = d, n.ahead = horizon,plot=FALSE)
ASE = mean((ts_most_recent_5_years.flu_percent_positive_cases[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
ASEHolder[i] = ASE
}
WindowedASE = mean(ASEHolder)
WindowedASE[1] 53.46084
[1] 45.45367
# visualization of windowed ASE over time
newASE = c(rep(NA, 131), ASEHolder) # this plots ASE from week 131 onward
par(mar = c(5,5,2,5))
plot(ts_most_recent_5_years.flu_percent_positive_cases,type="l",ylab='Observation Value',xlab='Time',col="blue",main = 'Rolling Window ASE Over Time \n(6-month forecast)'
)
par(new = T)
plot(ASEHolder,type="l",lty=2,axes=F,ylab=NA,xlab=NA,col="red"
)
axis(side=4)
mtext(side=4, line=3, 'ASE')
legend("topleft",legend=c("Obs. Value","ASE"),lty=c(1,2),col=c("blue","red"),cex=.6
)From the rolling window ASE charts above and Table 2 below the performance of the model deteriorates as the horizon increases.
section here about why median rolling window ASE and not single ASE, or mean rolling window ASE. Maybe use histogram of rolling window ASE values to make your point.
Table 2 below shows that the model does better at forecasting shorter horizons.
Table 2: Rolling Window ASE
| Forecast Horizon | Mean Rolling Window ASE | Median Rolling Window ASE |
|---|---|---|
| 2 weeks | 5.446 | 0.769 |
| 4 weeks | 12.700 | 2.908 |
| 3 months | 37.913 | 21.630 |
| 6 months | 53.461 | 45.454 |
Candidate Model 2 - ARIMA(3,2,1) s=52
The first candidate model didn’t do a great job in terms of forecasting longer horizons. Taking a closer look at ACF and spectral densities
There is no clear upward or downward trend in the data; however, seasonal trend is present as evidenced by the cyclic behavior of the realizations. Seasonality causes the time series to be non-stationary because the mean of the data is not constant. The mean value at a point in time may be different that the mean value at a different point in time within a particualr season/cycle (e.g., sales of frozen turkeys are every year around November and December than the rest of the year).
ACF and Spectral Density
The autocorrelations exhibit sinusoidal ACF behavior converging to zero, which is characteristic of complex conjugate roots and second order factors.
# plot the ACF and spectral densities
#invisible allows the plot to print, but supresses the output
invisible(acf(ts_most_recent_5_years.flu_percent_positive_cases, lag.max = 200))Additionaly, when looking at the spectral density, which helps identify the frequency content of a time series, there is a significant peak near 0 suggesting complex roots and seasonality in the data. In the first spectral density plot there is a large peak near zero. When the truncation point is changed to 20, this peak is at approimately 0.01923, or 1/52, which indicates a period of 52 weeks, or one year.
# plot the ACF and spectral densities
# invisible allows the plot to print, but supresses the output
par(mfrow = c(1,2))
invisible(parzen.wge(ts_most_recent_5_years.flu_percent_positive_cases))
invisible(parzen.wge(ts_most_recent_5_years.flu_percent_positive_cases,trunc = 20))These observations support the conclusion of non-stationarity and suggest we can take a seasonal difference to stationarize the data.
par(mfrow = c(1,2))
acf(ts_most_recent_5_years.flu_percent_positive_cases)
pacf(ts_most_recent_5_years.flu_percent_positive_cases)Overfit the data
Coefficients of Original polynomial:
0.0000 1.0000
Factor Roots Abs Recip System Freq
1+1.0000B -1.0000 1.0000 0.5000
1-1.0000B 1.0000 1.0000 0.0000
# Overfit the data
#est.ar.wge(ts_most_recent_5_years.flu_percent_positive_cases, p=55, type='burg')The only factor of \((1-B^52)\) that has the behavior we expected is \((1-1.9848B+B^2)\). This suggests that using \(Y_t = (1-1.9848B+B^2)X_t\) as the stationarizing transformation is appropriate.
A non-seasonal difference is taken to see if that transformation yields stationary data. After taking that seasonal difference the realizations still appear to have some pattern and most of the autocorrelations are outside the significance bands.
# take the first difference (1 - B)^1 of the data
d1 <- artrans.wge(ts_flu.percent_positive_cases, 1)Seasonality is suspected and makes sense for the data. Factors from overfit factor tables can be compared to the non-stationary factors \((1-B^s)\), where \(s\) is the seasonal term. If non-stationary factors match with the factors of \((1-B^s)\) well, then the data can be transformed by the \((1-B^s)\) factor.
The factor tables suggest there is an annual seasonal component (s=1).
Taking the second difference of the data yields a realization that looks like white noise and fewer lags outside the significance bands suggesting stationarized data.
# transform differenced data by (1 - B^1), i.e., take an annual seasonal difference (s=1)
d2 = artrans.wge(d1, 1) # here we're taking the differenced data, d1, and transforming it again---------WORKING... PLEASE WAIT...
Five Smallest Values of aic
p q aic
24 3 5 1.206270
23 3 4 1.214357
20 3 1 1.215555
50 8 1 1.216192
34 5 3 1.218166
---------WORKING... PLEASE WAIT...
Five Smallest Values of bic
p q bic
20 3 1 1.261137
5 0 4 1.270700
23 3 4 1.287288
24 3 5 1.288317
17 2 4 1.299023
BIC selects ARMA(3,1) model.
We then estimate the parameters of the differenced data with aic5.wge.
Coefficients of Original polynomial:
0.4768 0.2190 -0.0985
Factor Roots Abs Recip System Freq
1-0.5490B 1.8214 0.5490 0.0000
1+0.4612B -2.1683 0.4612 0.5000
1-0.3889B 2.5712 0.3889 0.0000
ARMA(\(p,\,q\)) models assume that the noise component, \(a_t\), of the model is white noise. If the residuals are not white noise, this suggests that further modeling may be necessary to better explain the behavior in the data. We employ a visual inspection and a more formal test to determine if the residuals are white noise.
Visual Inspection of Residuals
The residuals look random per the realization plot, suggesting white noise.
$autplt
[1] 1.000000000 0.012364334 0.022644699 -0.002517460 -0.016591296
[6] 0.136368562 -0.023855196 -0.035591531 -0.022353082 -0.077254235
[11] -0.036478329 -0.123718262 -0.062695833 -0.052554838 0.014033983
[16] 0.040132395 -0.003481309 -0.055316529 -0.005445382 -0.003324064
[21] 0.010787437 0.029855223 -0.014746613 0.084682828 -0.043904654
[26] 0.012874838
$freq
[1] 0.002217295 0.004434590 0.006651885 0.008869180 0.011086475 0.013303769
[7] 0.015521064 0.017738359 0.019955654 0.022172949 0.024390244 0.026607539
[13] 0.028824834 0.031042129 0.033259424 0.035476718 0.037694013 0.039911308
[19] 0.042128603 0.044345898 0.046563193 0.048780488 0.050997783 0.053215078
[25] 0.055432373 0.057649667 0.059866962 0.062084257 0.064301552 0.066518847
[31] 0.068736142 0.070953437 0.073170732 0.075388027 0.077605322 0.079822616
[37] 0.082039911 0.084257206 0.086474501 0.088691796 0.090909091 0.093126386
[43] 0.095343681 0.097560976 0.099778271 0.101995565 0.104212860 0.106430155
[49] 0.108647450 0.110864745 0.113082040 0.115299335 0.117516630 0.119733925
[55] 0.121951220 0.124168514 0.126385809 0.128603104 0.130820399 0.133037694
[61] 0.135254989 0.137472284 0.139689579 0.141906874 0.144124169 0.146341463
[67] 0.148558758 0.150776053 0.152993348 0.155210643 0.157427938 0.159645233
[73] 0.161862528 0.164079823 0.166297118 0.168514412 0.170731707 0.172949002
[79] 0.175166297 0.177383592 0.179600887 0.181818182 0.184035477 0.186252772
[85] 0.188470067 0.190687361 0.192904656 0.195121951 0.197339246 0.199556541
[91] 0.201773836 0.203991131 0.206208426 0.208425721 0.210643016 0.212860310
[97] 0.215077605 0.217294900 0.219512195 0.221729490 0.223946785 0.226164080
[103] 0.228381375 0.230598670 0.232815965 0.235033259 0.237250554 0.239467849
[109] 0.241685144 0.243902439 0.246119734 0.248337029 0.250554324 0.252771619
[115] 0.254988914 0.257206208 0.259423503 0.261640798 0.263858093 0.266075388
[121] 0.268292683 0.270509978 0.272727273 0.274944568 0.277161863 0.279379157
[127] 0.281596452 0.283813747 0.286031042 0.288248337 0.290465632 0.292682927
[133] 0.294900222 0.297117517 0.299334812 0.301552106 0.303769401 0.305986696
[139] 0.308203991 0.310421286 0.312638581 0.314855876 0.317073171 0.319290466
[145] 0.321507761 0.323725055 0.325942350 0.328159645 0.330376940 0.332594235
[151] 0.334811530 0.337028825 0.339246120 0.341463415 0.343680710 0.345898004
[157] 0.348115299 0.350332594 0.352549889 0.354767184 0.356984479 0.359201774
[163] 0.361419069 0.363636364 0.365853659 0.368070953 0.370288248 0.372505543
[169] 0.374722838 0.376940133 0.379157428 0.381374723 0.383592018 0.385809313
[175] 0.388026608 0.390243902 0.392461197 0.394678492 0.396895787 0.399113082
[181] 0.401330377 0.403547672 0.405764967 0.407982262 0.410199557 0.412416851
[187] 0.414634146 0.416851441 0.419068736 0.421286031 0.423503326 0.425720621
[193] 0.427937916 0.430155211 0.432372506 0.434589800 0.436807095 0.439024390
[199] 0.441241685 0.443458980 0.445676275 0.447893570 0.450110865 0.452328160
[205] 0.454545455 0.456762749 0.458980044 0.461197339 0.463414634 0.465631929
[211] 0.467849224 0.470066519 0.472283814 0.474501109 0.476718404 0.478935698
[217] 0.481152993 0.483370288 0.485587583 0.487804878 0.490022173 0.492239468
[223] 0.494456763 0.496674058 0.498891353
$db
[1] 2.63065455 -2.65241936 -6.70016751 -0.57155179 -0.34814824
[6] -3.52668810 -8.12534582 -11.01568211 -6.83942551 -4.32221653
[11] 2.38292220 0.67723763 4.39289268 4.21667036 3.07788119
[16] 1.29704652 -6.48898073 0.77558262 -2.30901417 6.05538490
[21] 4.96870487 4.36445550 -7.51349202 4.27321754 -1.08727017
[26] -28.03519758 -3.66968882 3.28612185 2.66982487 -7.25737891
[31] -0.26341085 1.45280999 -0.24308183 -1.34464609 -5.31965376
[36] -1.96081949 4.23884276 -6.91032206 -2.96938380 -10.38054263
[41] -0.33943613 -36.45438866 -10.62432851 -12.51540540 -5.20595752
[46] -1.69632506 -5.11308708 3.56435378 -13.03298220 -3.58851983
[51] -12.97629185 -6.88653409 -11.47883143 -1.44502690 -11.68822608
[56] 6.45615251 -6.71341307 -2.16773325 -5.28834876 -1.92658063
[61] -14.67157387 -7.74728833 7.47637215 -1.56702866 5.98648954
[66] -3.75943188 0.57362682 -10.52316048 -13.10429110 -2.55424635
[71] 0.08097251 5.22548628 -7.80712973 -5.77331613 -2.00792890
[76] -8.40345371 -8.64044738 -7.61105525 -2.42739733 5.51468103
[81] -2.23262205 2.76776434 -12.93041625 2.22238743 -2.18480996
[86] -5.62560338 -2.19045381 -0.66747595 1.71825331 4.56340883
[91] 6.51653567 -10.85214694 5.35625316 -1.09142991 -4.42831048
[96] -8.28325669 -5.18063912 -7.47904988 6.68186203 2.34180105
[101] -5.73640967 -7.23303446 0.13361678 -1.64024914 -0.41147884
[106] 4.83176988 -3.00230406 -2.45382965 2.07339219 -2.65396693
[111] -11.48764529 -13.11401712 -1.58749065 -5.05299736 -1.72720877
[116] 0.19455676 -1.96796013 5.48105630 4.09364378 -9.70141855
[121] -7.12515721 -27.53156635 -0.94045656 -7.64523815 -1.25429926
[126] 1.65399481 -6.21441221 0.04721608 -24.63988814 -3.52857820
[131] 0.84705382 1.85084347 -2.14753763 -3.88947421 -6.39087486
[136] -12.09674125 -1.06566735 -5.58206737 -13.47117413 -13.96774202
[141] 4.85392971 2.49302745 -2.13807962 -8.42825004 -3.94536190
[146] -3.71425511 -15.08592493 -8.49243193 -0.68668717 1.67816435
[151] -5.21762105 3.21560648 0.49150051 3.32446562 -8.74561118
[156] -15.21931542 -4.42013554 -1.79361764 -2.15239243 1.14831216
[161] 6.21496497 -5.93462855 -39.75614744 -9.12991040 -12.44591252
[166] -16.74408444 -9.69576940 7.51129002 -0.11562729 -7.23637459
[171] 1.05388313 -0.92235036 -5.78224722 -8.71114388 -3.23003612
[176] 2.88088465 7.56335665 1.51406995 -2.85473618 7.47910888
[181] -3.86947974 -6.81828892 -4.71836733 -0.16607783 -1.64748071
[186] 0.67346988 2.65810413 1.43693964 1.95644872 -15.42745508
[191] -7.57379808 -1.29669993 1.74238618 5.15129067 4.53556959
[196] -2.06541207 3.45925927 -1.38656672 -6.56167997 -11.12640355
[201] -2.26929151 -1.46160339 -6.43093509 2.08933407 0.33812955
[206] -19.42804806 -1.78110250 -15.75263284 -31.56954378 -9.86260020
[211] -13.37199742 0.94158691 5.94643395 -3.26174306 1.78880418
[216] -5.64106698 -14.15005027 -0.96541101 -8.37093189 -3.63783209
[221] 1.97098357 2.67959742 3.42330407 -6.14772827 -17.37247807
$dbz
[1] -1.492253045 -1.435021054 -1.337168653 -1.196458327 -1.012172588
[6] -0.786483913 -0.525106032 -0.237011975 0.066619577 0.373813884
[11] 0.673128314 0.954618474 1.210361057 1.434592965 1.623583480
[16] 1.775353771 1.889330007 1.965986072 2.006509326 2.012509314
[21] 1.985782183 1.928139381 1.841305370 1.726884025 1.586387123
[26] 1.421311595 1.233246410 1.023986173 0.795627836 0.550629531
[31] 0.291816929 0.022332566 -0.254463220 -0.535116358 -0.816191653
[36] -1.094330616 -1.366215278 -1.628416490 -1.877136122 -2.107892217
[41] -2.315237292 -2.492632564 -2.632610669 -2.727327286 -2.769510911
[46] -2.753673997 -2.677296503 -2.541623174 -2.351802107 -2.116317542
[51] -1.845916392 -1.552361958 -1.247322528 -0.941572410 -0.644540375
[56] -0.364145525 -0.106824488 0.122341381 0.319467482 0.481720001
[61] 0.607206578 0.694914300 0.744697449 0.757312207 0.734491282
[66] 0.679046355 0.594978897 0.487569608 0.363404102 0.230280904
[71] 0.096944329 -0.027400262 -0.133788360 -0.214344289 -0.263125712
[76] -0.276768164 -0.254756519 -0.199250728 -0.114529416 -0.006220845
[81] 0.119479122 0.256446472 0.399050817 0.542360220 0.682177438
[86] 0.814972893 0.937774411 1.048057439 1.143660357 1.222733032
[91] 1.283715037 1.325333811 1.346611697 1.346873219 1.325748693
[96] 1.283175818 1.219406077 1.135026344 1.031007044 0.908785618
[101] 0.770387122 0.618572116 0.456985648 0.290261575 0.124017781
[106] -0.035331766 -0.181010158 -0.306571233 -0.406785263 -0.478548139
[111] -0.521577887 -0.538679197 -0.535472512 -0.519650846 -0.499960734
[116] -0.485144092 -0.483027211 -0.499849842 -0.539843318 -0.605019051
[121] -0.695118655 -0.807691610 -0.938290837 -1.080797960 -1.227898146
[126] -1.371711304 -1.504549434 -1.619716789 -1.712221571 -1.779253855
[131] -1.820323520 -1.837036420 -1.832581590 -1.811065180 -1.776836870
[136] -1.733920142 -1.685605354 -1.634218725 -1.581052881 -1.526434243
[141] -1.469901625 -1.410470484 -1.346953271 -1.278297849 -1.203896971
[146] -1.123818720 -1.038916120 -0.950794806 -0.861645904 -0.773978147
[151] -0.690300384 -0.612809048 -0.543126278 -0.482118454 -0.429807206
[156] -0.385369444 -0.347211271 -0.313093479 -0.280283975 -0.245715863
[161] -0.206138967 -0.158266178 -0.098930035 -0.025273242 0.065007222
[166] 0.173362227 0.300069899 0.443954523 0.602222281 0.770478556
[171] 0.942952433 1.112903678 1.273147440 1.416616896 1.536895246
[176] 1.628675173 1.688131965 1.713215011 1.703866279 1.662163444
[181] 1.592362278 1.500784047 1.395470221 1.285527737 1.180135716
[186] 1.087290415 1.012505969 0.957794359 0.921234116 0.897273368
[191] 0.877671413 0.852796325 0.812954172 0.749516204 0.655752860
[196] 0.527399876 0.363037622 0.164362873 -0.063611221 -0.312461542
[201] -0.570486663 -0.823256397 -1.054784575 -1.249409998 -1.394177898
[206] -1.481139934 -1.508787392 -1.482046460 -1.410850720 -1.307884876
[211] -1.186292429 -1.057925912 -0.932342258 -0.816455618 -0.714638830
[216] -0.629069385 -0.560172739 -0.507075933 -0.468027927 -0.440768662
[221] -0.422841438 -0.411848745 -0.405654043 -0.402533443 -0.401282324
The plot of the sample autocorrelations shows lag 6 and lag 9 outside the confidence bands. However, at a 95% confidence level we would expect 1 out of every 20 lags to be outside the bands. Therefore, per a visual inspection, the residuals appear to be white noise.
Ljung-Box Test for residuals
A visual inspection looks at each autocorrelation separately. The Ljung-Box test is used to test the autocorrelations as a group to determine if the residuals are white noise. It tests the null hypothesis (\(H_0\)) that all autocorrelations (\(\rho\)) are zero (i.e., the residuals are white noise). \[ H_0: \rho_1 = \rho_2 = ... = \rho_K = 0\]
If at least one autocorrelation is not zero, then white noise is not present. \[H_a: at\;least\;one\; \rho_k \neq 0, \, for\,\, 1 \leq k \leq K\] In the tswge package, the residuals are found in the output variable $res. These are calculated within the functions est.ar.wge and est.arma.wge.
The Ljung-Box test yeilds \(p > 0.05\), which fails to reject the null hypothesis and suggests the residuals are white noise.
Obs 0.01236433 0.0226447 -0.00251746 -0.0165913 0.1363686 -0.0238552 -0.03559153 -0.02235308 -0.07725423 -0.03647833 -0.1237183 -0.06269583 -0.05255484 0.01403398 0.0401324 -0.003481309 -0.05531653 -0.005445382 -0.003324064 0.01078744 0.02985522 -0.01474661 0.08468283 -0.04390465
$test
[1] "Ljung-Box test"
$K
[1] 24
$chi.square
[1] 30.86419
$df
[1] 20
$pval
[1] 0.05700908
As a second check, we use a different K-value, which yields \(p > 0.05\) also suggesting the residuals are white noise.
# second check with different K-value
ljung.wge(estimated_parameters.d2_s52$res, p = 3, q = 1, K = 48) # pval is > 0.05 and we fail to reject the null hypothesisObs 0.01236433 0.0226447 -0.00251746 -0.0165913 0.1363686 -0.0238552 -0.03559153 -0.02235308 -0.07725423 -0.03647833 -0.1237183 -0.06269583 -0.05255484 0.01403398 0.0401324 -0.003481309 -0.05531653 -0.005445382 -0.003324064 0.01078744 0.02985522 -0.01474661 0.08468283 -0.04390465 0.01287484 0.01242087 0.009543115 0.04261902 -0.03115811 0.06402467 -0.005110867 0.01372502 0.03931109 0.01792378 0.08116198 -0.007698405 -0.004988617 0.1035932 0.03717967 0.03297893 0.03652906 0.05310496 -0.01074421 -0.01902018 0.05832936 0.01690418 -0.0882948 0.01377552
$test
[1] "Ljung-Box test"
$K
[1] 48
$chi.square
[1] 53.43477
$df
[1] 44
$pval
[1] 0.155851
Per a visual inspection and the Ljung-Box test, the residuals are white noise.
#Model 2
phis = params$phi
thetas = params$theta
s = 52
d = 2
trainingSize = 130 # this is the window size (we used a window of 2.5 years or 130 weeks)
horizon = 2 # we forecast out 2 weeks
ASEHolder = numeric() # this is an empty varible that will hold all the ASE values
for( i in 1:(260-(trainingSize + horizon) + 1))
{
forecasts = fore.aruma.wge(ts_most_recent_5_years.flu_percent_positive_cases[i:(i+(trainingSize-1))],
phi = phis, theta = thetas,
s = s, d = d, n.ahead = horizon,plot=FALSE)
ASE = mean((ts_most_recent_5_years.flu_percent_positive_cases[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
ASEHolder[i] = ASE
}
WindowedASE = mean(ASEHolder)
WindowedASE[1] 73.19725
[1] 23.77057
# visualization of windowed ASE over time
newASE = c(rep(NA, 131), ASEHolder) # this plots ASE from week 131 onward
par(mar = c(5,5,2,5))
plot(ts_most_recent_5_years.flu_percent_positive_cases,type="l",ylab='Observation Value',xlab='Time',col="blue",main = 'Rolling Window ASE Over Time \n(2-week forecast)'
)
par(new = T)
plot(ASEHolder,type="l",lty=2,axes=F,ylab=NA,xlab=NA,col="red"
)
axis(side=4)
mtext(side=4, line=3, 'ASE')
legend("topleft",legend=c("Obs. Value","ASE"),lty=c(1,2),col=c("blue","red"),cex=.6
)Conclusion
When fitting an ARMA model to a set of data, the goal is to explain as much of the variability in the data as is reasonably possible.